Don't store keyword multi fields when they trip ignore_above #132962

Kubik42 · 2025-08-15T02:20:45Z

This is a small refactor + bug for fix 131282.

The refactor changes how text, match_only_text, and annotated_text fields use keyword multi fields for synthetic source. Currently, this is done via the hasSyntheticSourceCompatibleKeywordField argument, where we set a boolean flag to indicate whether there is a keyword multi field that is either stored or has doc values. This is not a good approach for addressing 131282 because we want to disable the following logic for multi fields. With that disabled, the parent fields will no longer have a multi field to use for synthetic source.

We could designate one of the keyword fields as some kind of "synthetic source provider" for the parent. This way the field will always create a StoredField when ignore_above is tripped. However, this is a poor approach since it exposes how text fields are implemented to the keyword field. If the parent field decides how and what is stored, it'll be a lot clearer in the code.

This is where this PR comes in. It aims to remove hasSyntheticSourceCompatibleKeywordField (although kept for now for bwc) and instead relies on the syntheticSourceDelegate. With the addition of a new method canUseSyntheticSourceDelegateForSyntheticSource(), which is called during indexing, we can determine whether a particular keyword multi field is a valid supporter of synthetic source. If it isn't, then the parent field will explicitly create a StoredField for that.

Note: there are a lot of changed files, that said, most of them are just constructor changes. The actual changes are pretty limited.

Kubik42 · 2025-08-15T19:26:47Z

Accidentally nuked the previous PR while messing with branches. I did address all of the comments besides this one.

Kubik42 · 2025-08-22T22:53:24Z

...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java

-                storedFieldInBinaryFormat
+                isWithinMultiField,
+                storedFieldInBinaryFormat,
+                TextFieldMapper.SyntheticSourceHelper.syntheticSourceDelegate(getFieldType(), multiFields)


This was somewhat of a bug - the syntheticSourceDelegate was missing despite MatchOnlyTextMapper using it indirectly. With this change, we're now passing the delegate directly to TextFieldMapper.

Kubik42 · 2025-08-22T22:57:13Z

...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java

                    "Field [" + name() + "] of type [" + CONTENT_TYPE + "] cannot run positional queries since [_source] is disabled."
                );
            }
-            if (searchExecutionContext.isSourceSynthetic() && withinMultiField) {


this block of code was broken down into three smaller functions for readability:

parentFieldFetcher

delegateFieldFetcher

sourceFieldFetcher

Kubik42 · 2025-08-22T23:16:04Z

server/src/main/java/org/elasticsearch/index/mapper/CompositeSyntheticFieldLoader.java

+    /**
+     * Returns a new {@link CompositeSyntheticFieldLoader} that merges this field loader with the given one.
+     */
+    public CompositeSyntheticFieldLoader mergedWith(CompositeSyntheticFieldLoader other) {


This is needed to merge two field loaders: one for loading values stored by the parent field (when ignore_above is tripped), and the second for loading values stored by the keyword multi field (when ignore_above isn't tripped). Since the keyword multi field already produces a CompositeFieldLoader, I'm just extending that class.

Kubik42 · 2025-08-22T23:27:12Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

            return new BlockSourceReader.BytesRefsBlockLoader(fetcher, sourceBlockLoaderLookup(blContext));
        }

+        public boolean isIgnoreAboveSet() {


Helper function for better readability

Kubik42 · 2025-09-18T23:25:50Z

...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java

        public MatchOnlyTextFieldMapper build(MapperBuilderContext context) {
            BuilderParams builderParams = builderParams(this, context);
            MatchOnlyTextFieldType tft = buildFieldType(context, builderParams.multiFields());
-            final boolean storeSource;


this logic has been replaced by the new logic in parseCreateField()

Kubik42 · 2025-09-18T23:31:02Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

-         * for synthetic source.
-         */
-        public String originalName() {
-            return originalName;


this has been moved up to TextFamilyFieldType, which KeywordFieldType now extends

Kubik42 · 2025-09-18T23:32:39Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

-        private boolean normsDefault() {
-            if (indexCreatedVersion.onOrAfter(IndexVersions.DISABLE_NORMS_BY_DEFAULT_FOR_LOGSDB_AND_TSDB)) {
-                // don't enable norms by default if the index is LOGSDB or TSDB based
-                return indexMode != IndexMode.LOGSDB && indexMode != IndexMode.TIME_SERIES;
-            }
-            // bwc - historically, norms were enabled by default on text fields regardless of which index mode was used
-            return true;
-        }


Calling this outside of the constructor was bad practice, hence I moved it into the constructor. See the 313-320 below

As long as it access fields that have been initialized the original method shouldn't be problematic. But I get what you say.

elasticsearchmachine · 2025-09-19T19:28:02Z

Hi @Kubik42, I've created a changelog YAML for you.

jordan-powers

Overall, this looks good. I had a couple of nits, and a bwc question in the TextFieldMapper. This should also definitely be reviewed by Martijn.

...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java

server/src/main/java/org/elasticsearch/index/mapper/CompositeSyntheticFieldLoader.java

...rc/test/java/org/elasticsearch/index/mapper/annotatedtext/AnnotatedTextFieldMapperTests.java

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

jordan-powers · 2025-09-26T00:59:16Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

-            // _ignored_source field will contain entries for this field if it is not stored
-            // and there is no syntheticSourceDelegate.
-            // See #syntheticSourceSupport().
-            // But if a text field is a multi field it won't have an entry in _ignored_source.
-            // The parent might, but we don't have enough context here to figure this out.
-            // So we bail.
-            if (isSyntheticSource && syntheticSourceDelegate == null && parentField == null) {
-                return fallbackSyntheticSourceBlockLoader(blContext);
-            }
-
            SourceValueFetcher fetcher = SourceValueFetcher.toString(blContext.sourcePaths(name()));
            return new BlockSourceReader.BytesRefsBlockLoader(fetcher, blockReaderDisiLookup(blContext));
        }

-        FallbackSyntheticSourceBlockLoader fallbackSyntheticSourceBlockLoader(BlockLoaderContext blContext) {
-            var reader = new FallbackSyntheticSourceBlockLoader.SingleValueReader<BytesRef>(null) {
-                @Override
-                public void convertValue(Object value, List<BytesRef> accumulator) {
-                    if (value != null) {
-                        accumulator.add(new BytesRef(value.toString()));
-                    }
-                }
-
-                @Override
-                protected void parseNonNullValue(XContentParser parser, List<BytesRef> accumulator) throws IOException {
-                    var text = parser.textOrNull();
-
-                    if (text != null) {
-                        accumulator.add(new BytesRef(text));
-                    }
-                }
-
-                @Override
-                public void writeToBlock(List<BytesRef> values, BlockLoader.Builder blockBuilder) {
-                    var bytesRefBuilder = (BlockLoader.BytesRefBuilder) blockBuilder;
-
-                    for (var value : values) {
-                        bytesRefBuilder.appendBytesRef(value);
-                    }
-                }
-            };
-
-            return new FallbackSyntheticSourceBlockLoader(
-                reader,
-                name(),
-                IgnoredSourceFieldMapper.ignoredSourceFormat(blContext.indexSettings().getIndexVersionCreated())
-            ) {
-                @Override
-                public Builder builder(BlockFactory factory, int expectedCount) {
-                    return factory.bytesRefs(expectedCount);
-                }
-            };
-        }
-


Right, in the old implementation, sometimes the synthetic source support would fall back to storing values in ignored_source. This is no longer the case with the new implementation.

However, we need to consider BWC. There will be some indices that have had data written before this change and thus might have data stored in ignored_source.

We still need to support loading values from ignored source in this case.

I suspect the reason the bwc tests have not caught this is because using the FallbackSyntheticSourceBlockLoader is just an optimization. If we use a BlockSourceReader, it will still work. However, this means it will be constructing the entire _source of the document from the synthetic source, then parsing it to determine the value. Better to use the FallbackSyntheticSourceBlockLoader which knows how to extract only the individual field we're interested in out of the ignored source.

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

server/src/main/java/org/elasticsearch/index/IndexVersions.java

Kubik42 · 2025-09-30T04:07:49Z

server/src/main/resources/transport/upper_bounds/8.18.csv

@@ -1 +1 @@
-initial_elasticsearch_8_18_8,8840010


idk why these files are popping up, I'm pulling them directly from main. When I rebase before merging, I hope they'll disappear.

This is occurring because you are out of date with upstream main. If you sync the upstream branch (ie the "Update Branch" button in Github) the diff should disappear. But you can also ignore for now, the transport version generation task is pulling in changes from upstream main.

This behavior isn't intentional; it's a bug in transport version generation, and should go away soon.

martijnvg

Thanks @Kubik42 - LGTM

martijnvg · 2025-10-02T12:21:30Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

-        private boolean normsDefault() {
-            if (indexCreatedVersion.onOrAfter(IndexVersions.DISABLE_NORMS_BY_DEFAULT_FOR_LOGSDB_AND_TSDB)) {
-                // don't enable norms by default if the index is LOGSDB or TSDB based
-                return indexMode != IndexMode.LOGSDB && indexMode != IndexMode.TIME_SERIES;
-            }
-            // bwc - historically, norms were enabled by default on text fields regardless of which index mode was used
-            return true;
-        }


As long as it access fields that have been initialized the original method shouldn't be problematic. But I get what you say.

jordan-powers

LGTM, thanks Dmitry!

… test names to camel case

Kubik42 added >bug Team:StorageEngine labels Aug 15, 2025

elasticsearchmachine added the v9.2.0 label Aug 15, 2025

Kubik42 force-pushed the 131282-2 branch from 205f795 to ec91a34 Compare August 15, 2025 03:13

Kubik42 force-pushed the 131282-2 branch 6 times, most recently from 1f84854 to 53ded56 Compare August 22, 2025 21:33

Kubik42 marked this pull request as ready for review August 22, 2025 22:51

elasticsearchmachine added needs:triage Requires assignment of a team area label and removed Team:StorageEngine labels Aug 22, 2025

Kubik42 commented Aug 22, 2025

View reviewed changes

Kubik42 force-pushed the 131282-2 branch from 53ded56 to 485847f Compare August 22, 2025 22:57

Kubik42 commented Aug 22, 2025

View reviewed changes

Kubik42 force-pushed the 131282-2 branch from 485847f to 6aa57a9 Compare August 22, 2025 22:58

Kubik42 changed the title ~~Abstracted how Text fields use Keyword fields inside of Text fields~~ Don't store keyword multi fields when they trip ignore_above Aug 22, 2025

Kubik42 added the Team:StorageEngine label Aug 22, 2025

elasticsearchmachine removed the Team:StorageEngine label Aug 22, 2025

Kubik42 commented Aug 22, 2025

View reviewed changes

Kubik42 force-pushed the 131282-2 branch from b8f0584 to 6aa57a9 Compare August 22, 2025 23:25

Kubik42 added the Team:StorageEngine label Aug 22, 2025

elasticsearchmachine removed the Team:StorageEngine label Aug 22, 2025

Kubik42 commented Aug 22, 2025

View reviewed changes

Kubik42 added Team:StorageEngine and removed needs:triage Requires assignment of a team area label labels Aug 23, 2025

elasticsearchmachine added needs:triage Requires assignment of a team area label and removed Team:StorageEngine labels Aug 23, 2025

Kubik42 reopened this Sep 15, 2025

Kubik42 force-pushed the 131282-2 branch from 2c2e5b2 to aad9ccf Compare September 18, 2025 22:20

Kubik42 commented Sep 18, 2025

View reviewed changes

Kubik42 marked this pull request as ready for review September 19, 2025 19:27

jordan-powers reviewed Sep 26, 2025

View reviewed changes

Kubik42 commented Sep 30, 2025

View reviewed changes

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

martijnvg approved these changes Oct 2, 2025

View reviewed changes

jordan-powers approved these changes Oct 2, 2025

View reviewed changes

Kubik42 and others added 11 commits October 2, 2025 13:51

Merged with main

8f5665f

Renamed Builder

d46ea26

Delete docs/changelog/134582.yaml

4225513

Cleaned up

d34a52a

Fixes

3b75e6c

Removed redundant tests

1f85335

Update docs/changelog/132962.yaml

84755a6

Addressed feedback - renamed some functions, converted all snake case…

93f8609

… test names to camel case

Reverted the removal of fallback synthetic source in block loader

2349297

[CI] Auto commit changes from spotless

3503532

Fixed failing tests

d1078aa

Kubik42 force-pushed the 131282-2 branch from be37c1e to d1078aa Compare October 2, 2025 21:00

Kubik42 merged commit ede15bb into elastic:main Oct 2, 2025
34 checks passed

Kubik42 deleted the 131282-2 branch October 2, 2025 23:16

This was referenced Oct 21, 2025

Added a Builder for KeywordFieldType, removed redundant constructors #136846

Closed

Synthetic source: avoid storing copy_to text or match_only_text target field by default #129190

Closed

jordan-powers mentioned this pull request Oct 28, 2025

Delegate synthetic source to keyword multi-fields when skip_store_original_value #137229

Merged

		@@ -1 +1 @@
		initial_elasticsearch_8_18_8,8840010

Don't store keyword multi fields when they trip ignore_above #132962

Don't store keyword multi fields when they trip ignore_above #132962

Uh oh!

Conversation

Kubik42 commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kubik42 commented Aug 15, 2025

Uh oh!

Kubik42 Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kubik42 Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kubik42 Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Sep 19, 2025

Uh oh!

jordan-powers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Kubik42 Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jordan-powers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Kubik42 commented Aug 15, 2025 •

edited

Loading

Kubik42 Aug 22, 2025 •

edited

Loading

Kubik42 Aug 22, 2025 •

edited

Loading

Kubik42 Aug 22, 2025 •

edited

Loading

Kubik42 Sep 30, 2025 •

edited

Loading